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ABSTRACT 


The field of transparent Machine Learning (ML) has contributed 
many novel methods aiming at better interpretability for computer 
vision and ML models in general. But how useful the explanations 
provided by transparent ML methods are for humans remains diffi- 
cult to assess. Most studies evaluate interpretability in qualitative 
comparisons, they use experimental paradigms that do not allow 
for direct comparisons amongst methods or they report only offline 
experiments with no humans in the loop. While there are clear 
advantages of evaluations with no humans in the loop, such as scal- 
ability, reproducibility and less algorithmic bias than with humans 
in the loop, these metrics are limited in their usefulness if we do not 
understand how they relate to other metrics that take human cog- 
nition into account. Here we investigate the quality of interpretable 
computer vision algorithms using techniques from psychophysics. 
In crowdsourced annotation tasks we study the impact of differ- 
ent interpretability approaches on annotation accuracy and task 
time. In order to relate these findings to quality measures for inter- 
pretability without humans in the loop we compare quality metrics 
with and without humans in the loop. Our results demonstrate that 
psychophysical experiments allow for robust quality assessment of 
transparency in machine learning. Interestingly the quality metrics 
computed without humans in the loop did not provide a consistent 
ranking of interpretability methods nor were they representative 
for how useful an explanation was for humans. These findings 
highlight the potential of methods from classical psychophysics for 
modern machine learning applications. We hope that our results 
provide convincing arguments for evaluating interpretability in its 
natural habitat, human-ML interaction, if the goal is to obtain an 
authentic assessment of interpretability. 
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1 INTRODUCTION 


In recent years complex machine learning (ML) models, many based 
on deep learning, have achieved surprising results in computer vi- 
sion, natural language processing and many other domains. These 
models are difficult to interpret, which inspired many researchers 
to investigate ways to render ML models interpretable [6, 13, 15, 19]. 
There are many motivations for interpretable ML methods. Do- 
main experts, data scientists or data engineers that control proper 
functioning of an ML pipeline need to be be able to access the 
rules learned by a ML system in an intuitive manner in order to 
quickly spot the root causes of errors. More generally the main 
motivation for research on transparent ML is that intuitive human 
understanding of ML predictions can is a prerequisite for a healthy 
trust relationship between humans and assistive ML systems. In 
particular transparency is argued to prevent algorithm aversion as 
well as algorithmic bias. Algorithm aversion refers to cases when 
humans do not trust ML systems, even when they know that the 
model predictions are more accurate than those of a human [5], algo- 
rithmic bias are cases of ethnical or gender biases in ML predictions 
[10]. In the following we will also use the term algorithmic bias to 
refer to cases of too much trust into an ML prediction, for instance 
when a human interacting with assistive ML technology blindly 
follows its predictions. The usual narrative is that explanations of 
ML decisions can increase human trust in them [25, 30]. 

A central problem with interpretability methods is that they are 
difficult to compare and evaluate. Most of the research compares 
methods using either proxy measures, that do not directly relate 
to interpretability by humans, as e.g. [27], or qualitative measures 
that render comparisons of results across studies difficult [32]. In 
this work we propose to use psychophysical methods to quantify 
and compare the quality of interpretability methods. We follow the 
ideas of [28] and base our approach on the assumption that the 
definition of interpretability is inherently tied to a human observer. 
Good interpretability methods should allow human observers to 
intuitively understand a ML prediction. Intuitive understanding of 
the rules learned by a ML system is reflected in how accurately 
and how fast humans make decisions when assisted with a trans- 
parent ML prediction. These two variables can be easily measured 
in psychophysical experiments that study the interaction between 
humans and ML systems. 

The motivation for this work is twofold: For one this work aims 
at complementing previous work on measuring the quality of in- 
terpretability methods by establishing a quantitative measure of 
interpretability in the domain of computer vision that captures 
aspects of human cognition. Ultimately this will help practitioners 
to choose the right interpretability method for a given use case 
and researchers to devise novel objectives for better interpretability 


methods. Secondly the goal of this study is to validate to what 
extent existing approaches for measuring interpretability without 
humans in the loop reflect the interpretability metrics we measure 
in psychophysical experiments. 

In the following we shortly highlight some of the related work 
and then describe an image annotation task, emotion recognition, 
as well as the ML model, the transparency approaches used and the 
experimental design for quantitatively evaluating interpretability 
with humans in the loop (HIL) and with no humans in the loop 
(NHIL). We compare the different interpretability approaches with 
respect to the HIL and NHIL metrics and analyze their relationship, 
in particular whether cheaper and more scalable machine based 
NHIL transparency metrics reflect the most relevant but more ex- 
pensive HIL transparency metrics. We conclude with highlighting 
the implications of our results for practitioners that build systems 
with human-ML interaction or transparent ML. 


2 RELATED WORK 


While the literature on evaluation on transparent ML is very diverse 
[19], there appears to be a consensus in the literature that model 
explanations should overlap with human intuitions and that there 
is a lack of quantitative evaluation standards [6, 20, 23]. The tech- 
nical contributions to the field of transparent ML can be broadly 
categorized into two types of methods. First there are methods 
that aim at rendering specific models interpretable, such as inter- 
pretability methods for linear models [12] or interpretability for 
neural network models [24, 29, 33]. Second there are interpretabil- 
ity approaches that aim at rendering any model interpretable, a 
popular example are the Local Interpretable Model-Agnostic Expla- 
nations (LIME) [25]. As these latter interpretability methods do not 
need to have access to the inner workings of a ML model, they 
are often referred to as black box interpretability methods. One of 
the challenges with most interpretability approaches is that it is 
difficult to evaluate how interpretable to humans a model predic- 
tion becomes when employing a given interpretability method. The 
most straightforward approach to evaluation of interpretability is 
to generate synthetic data from a known generative model and 
evaluate the explanations against the true data generation process 
[12, 34]. However it can be very challenging to design generative 
models for real data. 

In the field of computer vision there have been a number of 
interpretability approaches specialized for that application scenario 
and the method of choice in this field, deep neural networks. Some 
prominent examples include layerwise relevance propagation (LRP) 
[18], sensitivity analysis [29] and deconvolutions [33]. For compar- 
ing these different approaches the authors of [27] propose a greedy 
iterative perturbation procedure for comparing LRP, sensitivity 
analysis and deconvolutions. The idea is to remove features where 
the perturbation probability is proportional to the relevance score 
of each feature given by the respective interpretability method. The 
idea of using perturbations underlies also many other interpretabil- 
ity approaches, such as the work on influence functions [2, 11, 16] 
and methods based on game theoretic insights [22, 32]. 

While there are comprehensive surveys on this matter [9], the 
evaluation criteria are often problematic and in many cases do not 
allow a direct comparison of methods. Most attempts to evaluate 
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interpretability methods either rely on proxy measures that are not 
related directly to interpretability, such as runtime or robustness of 
the interpretability model under perturbations or they merely use 
qualitative measures, as in e.g. [32]. Reflecting the intuition that 
interpretability cannot be evaluated without taking a human in 
the loop there is increasing interest in investigating the quality of 
transparent ML methods in psychological experiments on human- 
machine interaction [14, 17, 22, 26, 28]. 

Building on these results we here employ psychophysical exper- 
iments in a crowdsourcing scenario in order to evaluate the quality 
of interpretability methods. This quality measure is closely related 
to the approach taken in previous work [17, 28] but we here focus 
on the computer vision domain, which requires specific experimen- 
tal designs as visual cognition is very different from cognition of 
text and semantics. For instance visual cognition is characterized 
by much faster processing speed compared to text understanding 
as done in [17, 28]. One aspect that is, to the best of our knowl- 
edge, underrepresented in the field of transparent computer vision 
algorithms is a comprehensive comparison between human in the 
loop metrics and more efficient machine based metrics. Without 
an in depth understanding of how machine based metrics relate to 
metrics that capture human cognition, it is difficult to assess the 
true quality of an interpretability method that was evaluated with 
machine based metrics only. 


3 EXPERIMENTS 


In the following we describe the annotation task and the technical 
prerequisites of our experiments, including the ML model used and 
the transparency approaches applied to it. We then explain the 
experimental paradigm for both interpretability evaluation with 
psychophysical experiments as well as the more commonly used 
evaluation with no humans in the loop. 


3.1 Annotation Task 


The annotation task was emotional expression classification on 
images. We used the extended Cohn-Kanade image data set [21] 
which contains images for the classes, anger, contempt, disgust, 
fear, happiness, sadness, surprise. The class distribution can be seen 
in Table 1. We reduced the data to a binary classification task in 
which annotators had to classify emotional expressions of anger 
and happiness. Some sample images are shown in Figure 1. The 
annotators had the option of not providing an annotation in case 
they did not recognize the emotional expression. We chose this 
data set over other standard benchmark tasks in the domain of 
computer vision as it did not involve the localization of the target 
object but rather the detection of a complex pattern in human 
faces. When applying interpretability methods to models applied to 
other benchmarks, like ImageNet [4], the explanations computed 
often focus on localization of the target object. This effect can be 
considered a convenient proxy for determining whether the model 
has learned the right features; for instance if the model explanation 
correctly localizes the target object, this is better than when the 
model explanation focuses on features that are not the target object 
but just correlate with its appearance in the training data set. An 
example of the latter would be a model explanation that highlights 
the basketball court when it should focus on the basketball to predict 


A psychophysics approach for quantitative comparison of interpretable computer vision models 4 


class number of images 
anger 45 
contempt 18 
disgust 59 
fear 25 
happiness 69 
sadness 28 
surprise 83 


Table 1: Class distribution of extended Cohn-Kanade (CK+) 
data set. We binarized the data set by extracting only images 
from the anger and happiness class. 


emotion precision recall fl-score support 
anger 0.48 0.47 0.47 45 
happiness 0.66 0.67 0.66 69 

avg / total 0.59 0.59 0.59 114 


Table 2: Held-out per label precision/recall/f1 scores of 
EmoPy used for comparing ML interpretability methods on 
the CK+ dataset 


the target object basketball. This form of overfitting is not unusual in 
models trained on common benchmark data sets and can be detected 
with interpretability methods. But we felt that for our purposes this 
is a confounding factor when we are interested in interpretability 
quality; we hence opted for the emotional expression task which 
did not require localization of the target object. 


3.2 Machine Learning Model 


For our experiments we used a computer vision model from an open 
source python toolkit that achieves state of the art performance 
on emotional expression prediction from images [7]. We used the 
default model without any modifications. The precision, recall, and 
F1 scores on the data set used in our experiments are shown in 
Table 2. Note that these predictive performances are not perfect, 
but they can be considered competitive with the state of the art for 
this particular classification task. We also emphasize that we here 
are focussing on interpretability methods, not the underlying ML 
model. The quality of the machine learning model was the same 
for all interpretability methods. 


3.3 Interpretability Methods 


We compared three different interpretability methods for the EmoPy 
computer vision model 


e Gradient: The gradient of the output w.r.t. the input image 
e Layerwise relevance propagation (Irp): attributes importance 
recursively to each neuron’s input relevance [18] 
e Guided backpropagation: applies ReLU in gradient computa- 
tion in addition to the gradient of a ReLU [31] 
For all methods we used the implementation in the iNNvestigate! 
package [1]. All methods were used with their default hyperparam- 
eters. For the LRP approach we used the sequential_preset_a 
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Figure 1: Examples of masked images with happy emotional 
expression (top three rows) and angry expression (bottom 
three rows) for different interpretability methods (gradient, 
guided-backprop [31] and LRP [18], shown in rows) and 
mask sizes (shown in columns) 


variant provided in the package. The list of methods is not meant 
to be exhaustive. The main purpose of this work is to illustrate that 
the combination of psychophysical methods and ML can be helpful 
for quantifying the usefulness of interpretability methods. For the 
sake of simplicity, we deliberately restricted the set of interpretabil- 
ity methods to just three methods that other experts in the field 
recommended to us as useful. 


3.4 Quality of Explanations 


We employ two different metrics to compare the quality of inter- 
pretability approaches, one standard approach similar to the com- 
monly used quality metrics with no humans in the loop (NHIL) and 
one approach based on psychophysical experiments with humans 
in the loop (HIL). In both settings we use all three interpretabil- 
ity methods to compute scores for each pixel in the image. These 
scores roughly speaking capture the importance of that pixel for 
the model’s prediction. Based on these scores we rank the pixels 
and mask a certain percentage of pixels. The percentages of shown 
pixels were ten logarithmically spaced values between 0 and 100 to 


account for the Weber-Fechner law postulating a logarithmic rela- 
tionship between stimulus and perception [8]. The masks showed 
5, 6, 8, 11, 15, 19, 26, 34, 45, 60 percent of pixels of the image. Some 
example images for the emotional expression anger and happiness 
are shown in Figure 1. These thresholds were based on initial ex- 
periments with different thresholds in which we determined the 
minimum number of pixels needed to detect the emotion and the 
number of pixels needed to enable most subjects to correctly classify 
the image. 


No humans in the loop (NHIL) metrics. When developing a new 
interpretability approach it is most convenient for researchers to 
iterate quickly on model improvements and to validate the improve- 
ments with tests that are ideally fast and can be conducted without 
humans in the loop. Most of these NHIL metrics perturb the input 
data in some way that takes into account the feature scores pro- 
vided by an interpretability method. For instance in [27] the authors 
replace small patches in an input image with noise and evaluate the 
predictive performance for each perturbation of the data. We follow 
this idea and slightly modify the perturbations to match the condi- 
tions used in the psychophysics experiments. In particular we mask 
a certain percentage of pixels and feed the masked image to the 
convolutional neural network to obtain a prediction. To evaluate 
the interpretability quality we evaluate the predictive performance 
of the EmoPy model on masked images. 


Psychophysical human in the loop (HIL) metrics. In order to quan- 
tify the quality of interpretability methods in HIL psychophysical 
experiments we adopt the ideas from [28], 

(1) Interpretability is associated with intuitive understanding 

(2) Intuitive understanding leads to fast and accurate decisions 


Accuracy and speed of Al-assisted decisions can give insights into 
the cognitive load inherent to understanding of ML predictions. 
When an explanation is intuitive we will follow it without too much 
thinking; but when we need more time to digest an explanation, its 
relative interpretability quality is lower compared to other explana- 
tions. More importantly, when ML assisted decisions are followed 
quickly even in cases when the ML predictions were wrong, this is 
a clear sign of unhealthy algorithmic bias. Evaluating both, reaction 
time and accuracy of annotations, can thus provide authentic and 
quantifiable metrics of interpretability quality. Based on these ideas 
we measured the annotation accuracy as well as reaction times in 
the above emotional expression classification task. In the experi- 
ments we systematically controlled the amount of pixels unmasked 
by a given interpretability method to investigate the dependency 
of the signal strength and the interpretability. 


3.5 User Interface and Experimental Design 


We built the user interface using the open source library jsPsych 
[3]. The library provides basic features to design a psychological ex- 
periment running in the browser. In our case, we used the package 
to build an experiment timeline that showed the image stimulus 
with an html button below to capture the annotation provided by 
the experimental subjects. In each trial of an experiment we show 
the same image with increasing percentages of pixels shown. As 
we used ten different mask sizes from 5% to 60% of all pixels in 
the image, subjects saw a series of ten images. For illustration we 
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Happy Angry 


| don't know 


Which emotion is shown in the image ? 


Figure 2: User interface of a stimulus shown in the experi- 
ment. The image stimulus for a given mask size is located 
in the center of the page. Below the image buttons for anno- 
tating the image with the emotional expression anger and 
happiness are shown, along with an I don’t know option. 


show a subset of those ten masks for each interpretability method 
in Figure 1. At the last image, when 60% of the image was shown, 
all subjects correctly identified the emotional expression, see also 
Figure 6. The entire experiment was designed to be completed in 
about 10 minutes, based on a pilot experiment. For each label five 
images were shown, which resulted in 5 (images) x 2 (classes) x 
3 (interpretability methods) x 10 (mask sizes) = 300 images in total 
that were annotated by each subject. The image stimulus is dis- 
played in the format as shown in Figure 2. For each subject the 
order of the trials was randomized, so each subject has seen each 
interpretability method and source image in a random order, but 
the order of unmasking the image was always the same. In total 62 
subjects participated in the experiment. 

The experiments were conducted on the crowdsourcing platform 
Amazon Mechanical Turk. We payed all subjects the minimum 
wage in the country of the research institution of the authors, 
11$US per hour. Mechanical Turk requires to show a preview of 
the experiment, before the worker accepts to participate. For the 
preview, we provided an instruction and an example trial. In the 
main part of the experiment, after the workers agreed to participate, 
they will be first shown the number of trials they need to complete. 
When they proceed, the actual experiment will start and will be 
completed after the subjects have annotated 300 images. 


4 RESULTS 


In the following we first analyse the results of the psychophysical 
experiments and the results from the experiments without humans 
in the loop independently. Then we compare the interpretability 
metrics from both approaches. Lastly we also investigate the im- 
pact of different transparency approaches to negative forms of 
algorithmic bias. 
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Figure 3: Annotators’ confidence, measured by counting how 
often they did not provide a label but the I don’t know la- 
bel, as function of the mask size, aggregated over all inter- 
pretability methods. When 6% of all pixels were shown, 663 
annotators detected an emotion and provided a label, while 
1197 annotators did not detect an emotion and provided only 
the I don’t know label. When 60% of pixels were shown, all 
annotators detected the emotion. 
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Figure 4: Annotators’ confidence, measured by counting how 
often they did not provide a label but the I don’t know label, 
as function of the mask size, for each interpretability meth- 
ods. When only 6% to 15% of all pixels are shown, Guided 
BackProp assisted annotators are almost twice as certain as 
annotators assisted by Gradient explanations and provide 
annotations instead of the I don’t know label. 


4.1 Psychophysical Experiments 


Annotators’ uncertainty and interpretability. We investigated the 
impact of each interpretability approach on the uncertainty of an- 
notators by counting how often they did not provide an annotation 
but just the I don’t know label. Averaging across all interpretabil- 
ity approaches we see in Figure 3 that the experimental settings 
were chosen such that there is a smooth increase in annotators’ 
confidence when increasing the percentage of pixels of an image. 
Splitting the data into the different interpretability conditions, there 
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Figure 5: Reaction times of all subjects for different inter- 
pretability methods and mask sizes. Cognitive load related 
to processing the explanation appears to peak around mask 
sizes of 15% pixels and decays for larger masks once annota- 
tors have detected the emotional expression shown. 


is a clear effect of the interpretability method as shown in Figure 4. 
The simplest gradient approach leads to the highest annotator un- 
certainty and least number of annotations up to mask sizes of 45% 
of all pixels. Guided BackProp [31] in contrast leads consistently to 
the lowest annotator uncertainty and the highest number of anno- 
tations. Comparing Gradient and Guided BackProp we find that on 
average almost twice as many annotators are certain enough about 
their prediction that they provide an annotation when assisted 
with the Guided BackProp saliency map, compared to the Gradient 
explanation that more often led annotators to choose the I don’t 
know label. This finding highlights the importance of quantitatively 
comparing transparency approaches. The extent to which human 
users of ML can profit from transparency strongly depends on the 
quality of the explanation provided. 


Reaction times reflect cognitive load of interpretatons. In Figure 5 
we show the reaction times for each experimental condition. When 
most pixels are masked reaction times are low, as most subjects 
understand that they cannot make a correct prediction. For in- 
termediate mask sizes around 15% pixels shown, reaction times 
show a slight increase reflecting the increased cognitive load. For 
larger mask sizes, the reaction time decreases, as most subjects have 
provided an annotation already and keep clicking that label. 


Annotation accuracy distinguishes transparency methods. The 
most important metric for our purposes is the annotation accuracy 
for different interpretability methods. Higher quality explanations 
should lead to higher annotation accuracy. Indeed we find that 
annotation accuracy clearly distinguishes the three interpretability 
methods used in our experiments. In Figure 6 we show the annota- 
tion accuracy, averaged across subjects, for increasing mask sizes 
and all three different transparency approaches. Explanations using 
the plain gradient approach consistently led to the lowest annota- 
tion accuracy. The layerwise relevance propagation approach (LRP) 
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Figure 6: Annotation accuracies of all subjects for different 
interpretability methods and mask sizes. For small mask 
sizes, the annotation accuracy is as low as 20%; when 40% of 
the pixels are masked, all subjects reliably detect the correct 
emotional expression. For intermediate levels of masking, 
there is a clear ranking of interpretability methods, guided 
backprop achieves highest accuracies. 


[18] yielded slightly better annotation accuracies and annotators 
assisted with the Guided BackProp explanations [31] were consis- 
tently better than all other annotators. This effect was strongly 
dependent on the mask size and most pronounced for intermediate 
mask sizes around 15% of pixels shown. When more than 45% pixels 
were shown, subjects could detect the emotional expression reliably 
in all conditions. These results suggest that annotation accuracy in 
psychophysical experiments can serve as a robust quality indicator 
for interpretability methods. 


4.2 Metrics with no humans in the loop 


Next to the human in the loop experiments we also performed more 
standard experiments in which we tested the three interpretability 
approaches under the same experimental conditions as in the psy- 
chophysical experiments. For each mask size and interpretability 
method we computed the predictions of the EmoPy [7] model and 
computed the accuracy across all images for each condition. The 
results are shown in Figure 7 and demonstrate that at around 30% 
of all pixels the model achieves the highest performance, that is 
better than the optimal performance on the test. When masking 
more pixels the prediction accuracy decreases irregularly without 
any specific trend, unlike in the case of the human annotators. 
An important difference to the human in the loop experiments 
is however that the model accuracy is not affected by the inter- 
pretability method as clearly as in the psychophysical experiments. 
Across all mask sizes there is no clear winner and in some cases 
the method that scored worst in the psychophysical experiments, 
Gradient, achieves the best accuracies when evaluating it on ML 
model predictions alone. 

Less important but interesting is also that while the general 
trend of lower accuracies with smaller masks is the same for both 
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Figure 7: Model prediction accuracies for different inter- 
pretability methods and mask sizes. In contrast to human 
annotators, there is no clear ranking of methods based on 
the accuracy of the ML predictions. 


humans and machines in our experiments, human cognition tends 
to have a different sensitivity to the amount of pixels masked. While 
humans achieve a lower performance than the ML model when 
only 6% of pixels are shown, this effect is reversed when more than 
45% of pixels are shown: annotators do not make mistakes in the 
emotional expression classification task and the ML model only 
achieves accuracies around 60% in those conditions. 

Both of these findings demonstrate that human cognition and 
machine cognition share some properties but are very different in 
others. In particular interpretability metrics that are purely based 
on machine predictions do not seem to capture what makes an 
explanation useful for humans. 


4.3 Comparing NHIL and HIL metrics 


While the previous sections focussed on each metric individually 
we also compared the metrics obtained in psychophysical experi- 
ments with humans in the loop, see subsection 4.1, with the metrics 
obtained by conventional offline no human in the loop (NHIL) ap- 
proaches, see subsection 4.2. For the comparison we paired the 
machine based NHIL metrics with the psychophysical human in 
the loop (HIL) metrics by grouping the data by image, mask size and 
interpretability method. For the HIL metrics we then computed the 
average accuracy across all subjects for each image and ranked all 
three interpretability methods according to the average annotation 
accuracy achieved with a given explanation assistance across all 
subjects for a given image. For the NHIL metrics we ranked the 
methods according to the cross-entropy loss incurred by a predic- 
tion for each interpretability method, mask size and image. As the 
cross-entropy loss is a continuous loss, in contrast to accuracy per 
data point, this allowed to rank the methods for each image despite 
the fact that there was only one prediction per image. The inter- 
pretability method rankings for the psychophysics and machine 
based metrics are shown in Figure 8. In the left panel the aggregated 
ranks across all mask sizes show that despite the transformation of 
the metrics into ranks, the two types of metrics are not very similar. 


A psychophysics approach for quantitative comparison of interpretable computer vision models a 


Ranker = Al 


=m gradient 
mm irp 


Rank by ranker 3.0 
== guided_backprop 


2.0 
x 1.5 
© 
a 
1.0 
00 —— mm 


=m gradient 
0.5 mm irp 

Al Human 
Ranker 


0.0 


== guided_backprop 

Em See See Bee HEN E 
11 15 19 26 34 45 60 5 
% Shown pixels 


2.5 
2.0 
x 
5 
G15 
1.0 


Ranker = Human 


ib LU 


11 15 19 26 34 45 60 
% Shown pixels 


Figure 8: Comparison of interpretability method rankings obtained in psychophysical experiments with human annotators 
and in experiments based on ML predictions with no humans in the loop. Ranks were computed for each image and then aver- 
aged across all images. For each image human ranks were based on annotation accuracy averaged across all subjects; AI ranks 
were based on cross-entropy loss per image. Left panel: Rank for each interpretability method, averaged across all mask sizes. 
AI prediction based rankings show no clear differentiation of interpretability quality while rankings based on psychophysics 
show that Gradient based explanations are consistently worst and Guided BackProp explanations are consistently best. Middle 
and right panel: Ranks for each interpretability method computed on AI predictions and human annotators for each mask 
size. For most mask sizes human annotators’ accuracy was significantly higher for the Guided BackProp approach, there is no 


clear winner for the AI interpretability quality metric. 


Interpretability metrics obtained in the psychophysical experiments 
show a clear and robust ranking, while the rankings obtained by 
ML predictions do not allow to distinguish the three methods in 
terms of their interpretability. This is also reflected in the two right 
panels, which show the same data as in Figure 8(left), but split into 
all mask size conditions. The average ranks of the psychophysical 
experiments show the same clear pattern as the aggregate metrics, 
Guided BackProp is better than LRP which is in turn better than the 
plain Gradient explanation. In contrast it is difficult to single out the 
best interpretable explanation based on the machine based NHIL 
metrics; there is no significant difference between the methods for 
most thresholds, yet there seems to be a some advantage for Guided 
BackProp for mask sizes of 19% and 26%. Note that this trend is not 
reflected in the results of the psychophysical experiments. 

Overall these comparisons demonstrate that not only do machine 
based interpretability metrics not allow for a clear comparison of 
interpretability methods, more importantly these metrics are not 
representative of what is relevant for interpretabiltiy by humans 
either. 


4.4 Transparency and algorithmic bias 


The above results demonstrate the impact of interpretability on 
annotation and prediction accuracy, but they miss an important 
aspect of transparent ML methods: human bias to algorithmic deci- 
sions. When explanations are intuitive humans tend to replicate the 
predictions of algorithms [28], also in cases when the ML prediction 
is wrong. Such negative effects of transparency can be detrimental 
in real world applications. Hence measuring these effects helps to 
calibrate human-AlI collaboration for more responsible and efficient 
usage of assistive AI technology. In Figure 9 we show the overlap of 
human annotators’ predictions with the ML predictions, averaged 
across all images. Importantly we here only consider cases when 
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Figure 9: Annotators’ algorithmic bias measured as the over- 
lap between human annotators and the ML prediction when 
the ML prediction was incorrect. The interpretability method 
Guided BackProp was most helpful in terms of annotation 
accuracy (see Figure 6), but at the same time it also appeared 
to lead annotators to follow the model prediction when it 
was wrong. 


the ML model was wrong, as we are interested in the negative 
aspects of algorithmic bias. For the mask sizes for which we see a 
clear advantage of the Guided BackProp method, from 6% to 19% 
percent pixels shown (Figure 8), we also see a trend for larger algo- 
rithmic bias with explanations computed with the Guided BackProp 
method (Figure 9). Annotators provided the same (wrong) answer 
as the model more often when they were exposed to the Gradient 


BackProp explanation compared to other explanations. This shows 
that more intuitive explanations not only lead to the increased 
annotation accuracy we have seen in Figure 8, but also to the nega- 
tive form of algorithmic bias when annotators wrongly replicate a 
model’s prediction. 


5 CONCLUSION 


Methods that increase transparency of ML systems have become a 
major focus of research. Despite substantial advancements in the 
field and a plethora of methods available for rendering ML model 
predictions more interpretable, there appears to be no gold standard 
evaluation method for interpretability quality [9]. Reliable and 
quantitative measures for evaluating interpretability are however a 
fundamental prerequisite for designing and improving transparent 
ML systems. 

Many studies use interpretability evaluations that rely on ML 
predictions only, without humans in the loop [27]. This approach 
has the advantage that it is scalable and does not suffer from often 
subjective human judgements. But these measures are not directly 
related to the quantity of interest, how interpretable an explanation 
is for a human observer. Other studies evaluate interpretability in ex- 
periments with humans in the loop [14, 17, 22, 26, 28]. But these ap- 
proaches do not follow the same experimental design which makes 
comparisons across studies difficult. To the best of our knowledge 
there are few studies that use the same experimental conditions 
for humans and machines when evaluating interpretability meth- 
ods and that relate results from human in the loop experiments to 
evaluations without humans. 

In this study we used psychophysical experiments with humans 
to evaluate the quality of explanations for ML predictions. We 
compared those quality metrics with the metrics obtained in ex- 
periments without humans in the loop. Our results demonstrate 
that while psychophysical experiments allow to derive robust and 
clear rankings of interpretability quality, interpretability metrics 
obtained with ML predictions alone do not show a clear ranking of 
interpretability methods. More importantly our results also show 
that the metrics computed without humans in the loop are not only 
instable, they are also not representative of the rankings obtained 
in psychophysical experiments. These results highlight the poten- 
tial of standardized psychophysical tests for the evaluation of ML 
methods and indicate that evaluations of interpretability should 
not rely exlusively on experiments without humans in the loop. 
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